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Summary. Variational Bayes computational methods are attracting increasing in- 
terest because of their ability to scale to large data sets. Here, application of the 
non-conjugate variational message passing (NCVMP) algorithm to generalized lin- 
ear mixed models (GLMMs) is considered. Extending recently developed meth- 
ods in stochastic variational inference to the non-conjugate case, we combine 
NCVMP updates with stochastic natural gradient optimization of the variational 
lower bound to derive a stochastic NCVMP algorithm for fitting GLMMs, scalable 
to large data sets. We demonstrate that convergence for moderate-sized data sets 
can be accelerated by using stochastic NCVMP initially before switching to stan- 
dard NCVMP. Finally, we show that the way variational message passing updates 
separate into messages from above and below a node in a hierarchical model fa- 
cilitates an automatic computation of diagnostics for prior-likelihood conflict in the 
NCVMP algorithm which are very useful for model criticism. 
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1. Introduction 



Variational Bayes (VB) (Attias 2000) computational methods are very attrac- 



tive for Bayesian inference with large data sets. Nevertheless, for very large data 
sets, batch variational algorithms for many common models become increasingly 
inefficient as local variational parameters associated to each unit in the data set 



have to be updated at every iteration (Hoffman et al. 2010). They are also un- 



suitable in online settings where data arrive continuously as the algorithm can 
never complete one iteration. On the other hand, stochastic approximation (SA) 
methods (Robbins and Monro 1951) are able to make use of gradient estimates 



from mini-batches of data so that computational cost is reduced. Recently, |Tan| 
and Nott (2012) showed how to implement an algorithm called non-conjugate 
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variational message passing (NCVMP) (Knowles and Minka, 2011) for infer- 
ence in generalized linear mixed models (GLMMs). Similar to other batch VB 
algorithms for models with observation specific latent variables, the NCVMP 
algorithm for GLMMs has to iterate between updating local variational param- 
eters associated with individual observations and global variational parameters. 
We propose to combine NCVMP updates with stochastic natural gradient opti- 
mization of the variational lower bound when fitting GLMMs to large data sets. 
In addition, we show that diagnostics for prior-likelihood conflict can be obtained 
as an automatic by-product of NCVMP. Our paper makes three contributions. 
First, we extend stochastic variational inference to non-conjugate models and 
derive a stochastic NCVMP algorithm for fitting GLMMs, scalable to large data 
sets. Second, we show that convergence for moderate-sized data sets can be ac- 
celerated by using stochastic NCVMP in the first few iterations before switching 
to standard NCVMP. Third, we show that the way variational message pass- 
ing (VMP) updates separate into messages from above and below a node in a 
hierarchical model facilitates an automatic computation of diagnostics for prior- 
likelihood conflict useful for model criticism. Our "mixed message" diagnostics 
can be shown to approximate existing diagnostics in the statistical literature, 
namely the conflict diagnostics of Marshall and Spiegelhalter (2007). 

VMP is an algorithmic implementation of VB developed by |Winn and Bishop| 
(2005) which can be applied to a very general class of conjugate-exponential 
models. VMP proceeds by passing messages between nodes in a directed graph 
and the posterior distribution associated with a particular node can be up- 
dated using only local operations at the node. Knowles and Minka (2011 ) devel- 
oped NCVMP to extend VMP to non-conjugate models, allowing more choice 
in the computation of expectations in the variational lower bound. |Tan and] 



Nott (2012 1 recently applied NCVMP to GLMMs using a partially non-centered 



parametrization (PNCP). The PNCP lies on the continuum between the cen- 
tered parametrization (CP) and non-centered parametrization (NCP) and is 
data-dependent ( Papaspiliopoulos el aL|[2003 2007). The PNCP has previously 
been used to accelerate MCMC (Markov Chain Monte Carlo) and EM (expecta- 
tion maximization) algorithms for hierarchical models. In the VB context, Tan| 



and Nott (2012) showed that the PNCP is able to automatically determine a 



parametrization that is close to optimal and improve convergence while result- 
ing in more accurate approximations statistically. In this paper, we derive a 
stochastic version of NCVMP for GLMMs employing the PNCP. 

Recent development in VB methodology have branched out to online infer- 
ence and stochastic optimization so that VB has become a viable approach for 
handling data sets which are too large for standard batch VB algorithms. For 
computational efficiency, VB methods often rely on analytic solutions to inte- 
grals and conjugacy in the posterior. This limits the type of approximations and 
posteriors VB can handle. To overcome this restriction, Salimans and Knowles 
(2012) proposed a SA algorithm that does not require analytic evaluation of inte- 
grals, extending the VB approach to any posterior that is available in closed form 
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up to the proportionality constant. Hierarchical extensions of the basic approach 
allow the method to be made arbitrarily precise. Paisley et al. (2012) proposed a 
stochastic optimization algorithm using control variates that allows direct max- 
imization of the variational lower bound involving intractable integrals. Similar 



algorithms were considered by | Ji et al. (2010) and Nott et al. (2012). Hoffman 



et al. (2010) and Wang et al.\ (2011) developed online VB algorithms for la- 



tent Dirichlet allocation and the hierarchical Dirichlet process respectively using 
stochastic natural gradient optimization of the variational lower bound. [Hoffman 



et al. (2012) generalized these methods to derive stochastic variational inference 



for conjugate-exponential family models and demonstrated that for large data 
sets, stochastic variational inference converges faster than batch VB and to a 
better model. The key idea in stochastic gradient optimization is to use only 
a random subset of the data at each iteration to approximate the true gradi- 
ent over the whole data. This can result in significant reduction in computa- 



tional costs for large data sets. For instance. Bottou and Cun (2005) and Bottou 



and Bousquet (2008) showed that well-designed stochastic gradient algorithms 



outperform batch algorithms in large-scale learning problems. |Welling and Teh 



(2011) combined stochastic gradient optimization with Langevin dynamics for 
Bayesian learning of large-scale data sets and this algorithm was extended to the 
stochastic gradient Fisher scoring algorithm by Ahn et al. (2012 1. 



Amari ( 1998 ) discussed the use of natural gradient in stochastic gradient op- 



timization and pointed out that the steepest direction is given by the natural 
gradient when the parameter space is not Euclidean but has a Riemannian met- 
ric structure. Natural gradient online learning was also shown to be statistically 
efficient. However, the use of the Riemannian structure of the predictive distribu- 
tion in natural gradient learning is often complicated and Honkela et al. ( 2008 ) 
proposed using the Riemannian structure of the factorized variational posterior 
assumed in VB instead. This is very useful for models not in the conjugate- 
exponential family. Honkela et al. ( 2008 ) also showed that replacing the ordinary 
gradient in the conjugate gradient algorithm with the natural gradient can speed 
up variational learning. The VB algorithm was shown to be a type of natural 
gradient method in Sato (20011 and an online VB algorithm with a model se- 
lection mechanism was derived using SA for Gaussian mixture models. In the 
optimization of the variational lower bound, Hoffman et al. (2012) justify that 
the Riemannian metric is more appropriate than Euclidean distance since a large 
change in the parameter can lead to a small change in Kullback-Leibler (KL) 
divergence and vice versa. Therefore, we consider the natural gradient instead 
of the ordinary gradient of the variational lower bound in the stochastic opti- 
mization. For non-conjugate models, we will have to pre-multiply the gradient 
with the Fisher information matrix to obtain the natural gradient, unlike the 
conjugate-exponential case (Hoffman et al. 2012). 

GLMMs extend generalized linear models (GLMs) via the introduction of 
random effects to account for within-subject association and have wide ap- 
plications. Estimation of GLMMs using maximum likelihood is challenging as 
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the integral over the random effects is intractable. Computationally intensive 
methods such as numerical quadrature or MCMC can be used to approximate 
the intractable integrals. Other approximate methods include penalized quasi- 



likelihood ( 


Breslow et al. 


(Raudenbush et al. 


2000) 



1993) and Laplace approximation and its extensions 



Ormerod and Wand (2012) introduced Gaussian vari- 



ational approximation for GLMMs and Fong et al. (2010) illustrated the use of 
integrated nested Laplace approximations for fitting GLMMs using a Baycsian 



approach. SA has also been used in combination with MCMC (Zhu et al. 2002) 



and the EM algorithm (Jank 2006) for fitting GLMMs. In this paper, we focus 
on Poisson and logistic mixed models and their applications in longitudinal data 
analysis (Fitzmaurice et al. 2004). 

Model checking is an important part of statistical analyses. In the Bayesian 
approach, assumptions are made in the form of a sampling model and prior, and 
prior-likelihood conflict arises when the observed data are very unlikely under 
the prior model. A discussion of how to assess whether there is prior-data con- 
flict and whether its effects can be ignored can be found in |Evans and M oshonov 
(2006). Recently, Scheel et al. (2011) proposed a graphical diagnostic, the lo- 
cal critique plot, for identifying influential statistical modelling choices at the 
node level. See also Scheel et al. (2011) for a recent review of other methods 
in Bayesian model criticism. Marshall and Spiegelhalter (2007) proposed a di- 
agnostic test for identifying divergent units (units that do not appear to be 
drawn from assumed underlying distributions) in hierarchical models based on 
measuring the conflict between the likelihood of a parameter and its predictive 
prior given the remaining data. A simulation-based approach was adopted and 
diagnostic tests were carried out using MCMC. We show that the approach of 



Marshall and Spiegelhalter ( 2007 ) can be approximated in the VMP framework 



and that VMP facilitates an automatic computation of diagnostics for prior- 
likelihood conflict. These diagnostics are useful for model criticism and we focus 
on NCVMP for GLMMs. 

The paper is organized as follows. Section 2 specifies the model and describes 
a PNCP for the GLMM. A stochastic version of the NCVMP algorithm is derived 
in Section 3 and Section 4 describes how VMP facilitates automatic computation 
of diagnostics for prior- likelihood conflict. Section 5 considers examples including 
real and simulated data and Section 6 concludes. 



2. The GLMM with a PNCP 

In this paper, we concentrate on one-parameter exponential family models and 
consider a PNCP for the GLMM. Let denote the jth response in cluster i, 
i = l,...,n, j = 1, rij. We assume that conditional on a vector of random 
effects Ui, yij is independently distributed as 

Uij\ui ~ exp{j/ijCij - Kdj) + c(yij)} 
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where £jj is the canonical parameter and b(-) and c(-) are functions specific to 
the exponential family. The link function g relates the conditional mean of yij, 
= E(yij\ui) to the linear predictor, 77^ = X^/3 + ZfjUi as g(nij) — - Here, 
Xij and Z t j are p x 1 and r x 1 vectors of covariates, /3 is a p x 1 vector of unknown 
fixed regression parameters and the r x 1 random effects u^, z = l,...,n are 
independently distributed as N(0, D). We focus on logistic and Poisson models. 
For example, if yij ~ Bernoulli^.,), then b(x) — log{l + exp(a:)}, c(a;) = and 
we let g(^ij) = logit(/xy). For Poisson models, we allow for an offset so that if 



EijXij and log Ay 



r)ij, we let g(^) = log(f^) 



~ Poisson (/iy) where /ijj 
with b(x) — exp(x) and c(x) = — log(x!). 

For the ith cluster, let y l = (y a , y irH ) T , X, = (X? v Xf n ,) T , Z t = 
(Za, ... , Zj n .) T and ?7i = —, r\i ni ) T ■ We assume that the first columns of Xj 
and 2Tj (if non-null) are formed from a vector of ones of length rii , and the columns 
of Zi are a subset of the columns of Xj. We partition Xj as [X/ 2 ln^f X G ] 

with Zi = X R and f3 as (/J^, /3 sT , /3 gT ) t accordingly. Note that /3 5 is a vector 
of length s consisting of all the parameters corresponding to subject specific 
covariates. Then 

Vl = X R (P R + Ui ) + l ni xfp s + Xf/3 G 



Xf(CiP Hb + m) + XY^ where C, 







(r-l)xs 



and j3 RS = (/3 R , P s ) T . We introduce a, = Ci/3 RS '+m and 5, = a,- WiC^ RS , 
where Wi is an r x r matrix controlling the degree of centering. Wi = corre- 
sponds to the CP while Wi = I corresponds to the NCP. The PNCP is 



T)i = Cj/3 + X 



where C t = [Xp WjCj Xf], Le t W, 
N{WiP,D). From 



Tan and Nott 
IF 

-ij 



[(I r - Wi)Ci rX ( p _ r )] so that &i ~ 
(|2012|>, we set W t = (I f + D~ 1 )~ 1 D~ 1 where 



If = E"li VijX R X Rl for Poisson models and I f = ^JU {1 ^^ X R xf 
for logistic models. We approximate D and rjij by their posterior means in the 
actual implementation. 

For Bayesian inference, we specify a diffuse prior, N(0, E^), on /3 where E^ is 
large and an independ ent inverse Wishart prior, IW(v, S) on D. Following Kass 
and Natarajan (20061, we set v 



r and S = rR with 



R = c- 



I " 

-Y^ZfMiC^Zi 

II z ' 



(1) 



where Mj (j3) denotes the usual n.j x n.j diagonal GLM weight matrix with diagonal 
elements {v(flij) </(Ay) } > is the variance function and g(-) is the link 
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function. Here, jlij = g {Xj^fi + Zf^Ui) where Ui is set as for i — l,...,n 

and j3 is an estimate of the regression coefficients from the GLM obtained by 
pooling all the data and setting u\ = for all i. The value of c is an inflation 
factor representing how much within-cluster variability should be increased in 
determining R. We used c = 1 for the examples in this paper. 



3. Stochastic variational inference for GLMMs 



In the GLMM, the fixed effects f3 and the random effects covariance matrix D 
can be regarded as "global" variables which are common across clusters while the 
partially centered random effects &j, i = l,...,n, can be thought of as "local" 
variables associated only with the individual units. Tan and Nott (2012) pro- 



posed fitting GLMMs with a NCVMP algorithm that iterates between updating 
variational parameters of the local variables and re-estimating variational param- 
eters of the global variables. For large data sets, this procedure is inefficient as 
every unit in the data set has to be analysed before variational parameters of the 



global variables can be updated each time. Hoffman et al. (2012) suggested using 



SA methods (Robbins and Monro 1951 ) to derive more efficient algorithms. The 



key idea is to use stochastic gradients computed on mini-batches to approximate 
the true gradient over the whole dataset. They considered conjugate-exponential 
family models with applications on latent Dirichlet allocation and the hierarchi- 
cal Dirichlet process. We extend SA methods to non-conjugate models, focusing 
on the NCVMP algorithm for GLMMs. 



3. 1 . Natural gradient of the variational lower bound 



The stochastic gradient form of SA (see Spall 2003 ) aims to optimize an objective 



function based on noisy but unbiased estimates of its gradients. In variational 
approximation (VA), we wish to minimize the KL divergence between the true 
posterior and the approximating variational distribution which is equivalent to 
maximizing the variational lower bound. Following Hoffman et al. (2012), we 



use the natural gradient of the lower bound instead of the usual gradient in the 
SA. In this section, we review VA briefly and derive the natural gradient of the 
variational lower bound in the non-conjugate case. 

Let y denote the observed data and 9 the unknown model parameters. In 
variational inference (Jordan et al. 1999[ ), we approximate the true posterior 
p(0\y) by a more tractable distribution q(9) which is optimized to be close to 
p{0\y) in terms of the KL divergence. The KL divergence between q(9) and p(9\y) , 
KL(q\\p), can be written as 



q{9) log 



m 

P(0\y) 



cl9 = / q(9) log 



g(0) 

p(y,0) 



d9 + log p(y) 
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where p{y) = J p(y\9)p(6) d9 is the marginal likelihood. As KL(q\\p) > 0, 

logp(y)> J q{9)\og P -^d9 

= E q {logp(y,6)}-E q {logq(e)} 
= £ 

which gives a lower bound C on the log marginal likelihood. 

In NCVMP, the VA q{9) is of a factorized form YiiLi Qi{@i) f° r some partition 
...,9m} of 9, and each q{ belongs to some exponential family. Let 

qiiOi) = exp{Af t{9i) - h(\i)} 

where Xi is the vector of natural parameters and t(-) are the sufficient statistics. 
To minimize the KL divergence between q and the true posterior p(9\y), we max- 
imise the variational lower bound C with respect to the variational parameters 
Ai,...,A m which are the natural parameters of qi {9\ ),..., q m {9 m ) respectively. 
The gradient of C with respect to A^, i = 1, m, is given by 

f£ = £-E q {logp(y, 9)} - gf^log q(9)}. (2) 

To evaluate the first term in (pi), suppose that p{y,9) = JJ a f a (y,9) so that 
E q {\ogp(y, 9)} — J2 a fa(y, #)}• Since we have assumed in the VA that 9i 

is independent of all 9j where j ^ i, the only terms in ^q{^°S fa(y, #)} which 
depend on A, come from those factors /„ connected to 9i in the factor graph of 
p(y,9) (see |Bishop| ( [2006| and Fig. [I). Therefore, 



JLE q {logp(y,6)}= ( 3 ) 

aeN(0i) 

where the summation is over all factors in N(0i), the neighbourhood of 9t in the 
factor graph and S a {Xi) — J qi(8i\Xi)E-g i {log f a (y, 9)} d9i where E-o t denotes 
expectation with respect to the density Yljjtilji^j)- Note that S a {Xi) also de- 
pends on variational parameters of other variables connected to f a in the factor 
graph of p(y, 9). For the second term in E q {logq(8)} = J27Li E q {\ogq % {9i)} 
of which only the ith term depends on A^. Hence, 

^E q {log q (9)} = ^{xJ^-h(X i )} 

= V(Xi)Xi, (4) 

where we have used the fact that E q {t(9i)} — - and V(X,i) = q^^t denotes 
the variance-covariance matrix of t{9i). Putting Q and Q together, the gradient 
of the lower bound is 

£&= E 

a£N(6i) 
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To obtain the natural gradient of the variational lower bound, we multiply 



the inverse of the Riemannian metric with the gradient of £ ( Honkela et al. . 2008 
For the variational parameter A^ of q. 



Hoffman et al. 



metric is E q { 



2012 



/i\ 5x7 ~( d\, > /• u "^ c ax. 
Riemannian metric is simply V(Xi). The natural gradient V£(Ai) is thus 



) T }. Since 



a log gifgiiAj) _ 



i), the Riemannian 



V£(A,) = v(\ t y 



where L(Xi) denotes the lower bound considered as a function of the subset of 
the parameters, A i? with the other values held fixed. The natural gradient is zero 



when A, = ^(A.r 1 E aS 



as n (A, 



. In NCVMP, this optimality condition is 



j£JV(eo aAi 

used to obtain updates to A^ in a fixed-point iterations algorithm where 



X? = V (Af " 



dS a (\i) 



at the tth. iteration for i = 1, m. The lower bound is not guaranteed to increase 
at every step in NCVMP and sometimes convergence issues may be encountered 



which may require damping to fix (see Knowles and Minka 2011) 



When the factor f a is conjugate to qi(0i), that is, f a has the same func- 
tional form as qi{9{) with respect to (9^, the gradient of the lower bound can be 
simplified. Suppose 

f a (y, 9) = exp{g a (y, e^) T t{di) - h a (y, Ml 



where B—i = 



9 m ). Then 



E q {\ogf a (y,9)} = V{\i)E q {g a (y,B-i)}, 



where E q {g a (y,9_i)} does not depend on A^. When this holds for each fac- 
tor in the neighbourhood of 9i in the factor graph, the gradient of the lower 

bound is V(Xi) J2a£N{0i) E q {g a {y 1 — Aj and the natural gradient is sim- 
ply V£(A,) = J2aeN(8 z ) E q {g a {y,6-i)} - A*. The updates in the NCVMP algo- 
rithm reduce to 



i(t+i) 



Y, E q {g a (y,6-i)}\ x _ t= 

a£N{9i) 



where A_, = (Ai, A,_i, A i+ i, A m ) and A^] = (a[*\ \y} v a|^ x , A$). 
This is the same as updates in the VMP algorithm. NCVMP thus reduces to 
VMP for conjugate factors (Knowles and Minka 2011[ ). 
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\i=\, 



Fig. 1. Factor graph for GLMM model. Dotted box indicates that the contained node and 
its edges are duplicated n times. 



3.2. Stochastic variational algorithm for the GLMM 

For the GLMM, 9 = (a, /3, D) where a = (a~i T , 



) and 



P(l/,*) = J \{p(y l \^a l )p{^\P,D) \p[fi\^ p )p{p\v, S). 



This factorized distribution can be expressed as the factor graph in Fig. [T] We 
consider a VA to the posterior of the form 



q(e) = q^)q(D)l[q(a i ) 



i=l 

where q(0) is 2V(a$,5$), g(U) is IW(z/*, and g(a 4 ) is JV(/i| 4> E^). Let 
A/3, Ac and A^^ denote the natural parameter vectors of q(P), q(D) and q(&i) 
respectively for i — 1, ...,n. In the standard NCVMP algorithm, we would it- 
erate between updating the local variational parameters for each unit i, 
i = 1, n, and updating the global variational parameters and Xjj. This can 
be impractical for large data sets and impossible to accomplish for streaming 
data. Instead, we randomly select a mini-batch, S, of units (with or without re- 
placement), say of size \S\ > 1 at each iteration and compute NCVMP updates 
for Xa { , i € S repeatedly until convergence. Using these optimized local varia- 
tional parameters, we can compute unbiased estimates of the natural gradients 
V£(A^) and VC(Xd) and estimate A^ and Xd using SA. Similar approaches to 
online variational inference have been developed for latent Dirichlet allocation 



by Hoffman et dl. (2010) and for the hierarchical Dirichlet process by Wang et 



al. (2011). Hoffman et al. (2012) developed stochastic variational inference for 



conjugate-exponential family models and showed that this type of algorithm is a 
form of stochastic natural gradient ascent of the global variational parameters. 
The idea is to find a setting of the global variational parameters that maximizes 



the lower bound by considering the variational lower bound as a function of the 
global variational parameters with the local parameters optimized as a function 
of these global parameters. We have 



VC(X P ) = V{\ ) 



where Ai = E q {logp(yi\(3 , dti)} + E q {logp(ai\j3,D)}. An unbiased estimate of 
V£(A^) using the mini-batch S is VC(Xp) = Xp — Xp where 



~Xp = V(X P ) 



dX, 



E 9 {\ogp(p\Ep)} 



\S\ 



5> 

ies 



As the factors in the neighbourhood of D are all conjugate factors, 



VL(X D ) — 'V+r+l 



where B; = vec 



estimate of V£(Ad) using mini-batch S is V£(Ao) = A^ — Xjj where 

■|vec(5) 



An unbiased 



\s\ 



E 



■\B, 



When S is the entire data set, A^ and Xd are the updates of Xp and Ad in the 
standard NCVMP algorithm. Readers may refer to Tan and Nott (2012) where 
this variational posterior was considered, for the NCVMP update expressions 
from which the natural gradients may be inferred. The natural gradients are 
available in closed form for the Poisson GLMM, while adaptive Gauss-Hermite 



quadrature ( Liu and Pierce 1994 ) is required for evaluation of the natural gra- 



dients in the logistic case. In this paper, we focus on presenting a stochastic 
variational algorithm for the GLMM suitable for analyzing moderate to large 
datasets or even data arriving continuously in an infinite data stream. 

In the stochastic NCVMP algorithm, we combine NCVMP updates for the 
local variational parameters A^, i £ S with SA estimation of the global param- 
eters A^ and as outlined in Tab le [T| T he SA steps in lines 10 and 11 were 
introduced by Robbins and Monro ( 1951 1 for optimizing an objective function 



which in our case is the lower bound C, with local variational parameters op- 



timized as a function of the global ones. Hoffman et al. (2012) note that the 



gradient of this function is just the gradient of C with the local parameters fixed 
at their optimized values (see 



tain regularity conditions (see 



Hoffman et al. 2012 equation (39)). Under cer- 



Spall 2003), the iterates will converge to a local 
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Table 1. Stochastic NCVMP algorithm for GLMMs 



1: Initialize variational parameters \p, Xd, X& i for i = 1, ...,n 

2: and the partial centering parameters Wi for i — 1, ...,n. 

3: At the tth iteration, 

4: randomly select a set S of \S\ units. 

5: Update local variational parameters Aa ; for i 6 S 

6: repeatedly using NCVMP update: 

7: Ag» 4- V (a^ 1 ') ~* jA- [£ q {logp( yi |/3, a,)} + ^{logp^S/3, D)}] 

8: until convergence. 

9: Next, update the global variational parameters A/3 and Ad using 

10: X^ = A^ 1 ' + a t X7C (a^ 1 ') and 

11: Ag^Ag-^+otV/^Ag- 1 '). 

12: Repeat steps 3 to 11 until convergence. 



maximum of the lower bound. In particular, the sequence at, t > should satisfy 



at — > 0, a* = co, and < oo 

t 



(5) 



Intuitively, the condition (at — > 0,J2t a t < °°) ensures that the step size goes 
to zero sufficiently fast so that the iterates will converge while (J^t a t = 00 ) 
ensures that the rate at which the step sizes approach zero is slow enough to 



avoid false convergence (|Spall[ 2003|. We consider step sizes of the form ^jrW? 



where 0.5 < 7 < 1 and K > (J is a stability constant that helps to avoid unstable 
behaviour in the early iterations. In principle, the iterates will converge to a 
local optimum as long as conditions in ^ are satisfied. However, choices of the 
step s izes can strongly influence performance of the algorithm in practice ( Jank 
2006). As V£(A^j) = — \p, we have Ajg = (1 — a t )A^ 1 ^ + a t Xp from line 
10. Thus the t-iterate can be interpreted as a weighted average of the previous 
iterate and the NCVMP update of A/3 estimated using mini-batch S. The same 
holds for Xd- Standard NCVMP is recovered from stochastic NCVMP when the 
update for the local parameters in line 7 is performed only once and 7 = in 
the SA steps in lines 10 and 11. 

To initialize the variational parameters, we use the GLM obtained by pooling 
all the data and setting the random effects Ui — for all i. This GLM was used 
to compute the prior of D, IW(v, S). We set and as the estimate of the 
regression coefficients and its covariance matrix from the GLM respectively, v q = 



r, S« = S, fA = 



WifA and S|. = R where Wi is initialized by setting D = R 



and T}i by C^/il + X^iA & . for i — 1, n. Kass and Natarajan (2006) contain 
a justification of R being a reasonable guess for D in the absence of any other 
prior knowledge. Care should be taken in initializing the variational parameters 
as the NCVMP update steps in line 7 are not guaranteed to converge. We used 
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the initialization suggested above in all our examples and did not experience any 
convergence issues. The mean parameters of on were used to test for convergence 
in line 8 and we terminate the fixed-point iterations in line 7 when 



..« (*) ..Q 


(t-1) 














"<5i 





< 0.01, 



where |j • || represents the Euclidean norm. 



3.3. Switching from stochastic to batch VA 

Determining an appropriate stopping criterion for a SA algorithm can be a very 
challenging task. Some commonly used stopping criteria include stopping when 
the relative change in parameter values or objective function is sufficiently small 



or when the gradient of the objective function is sufficiently close to zero (Spall 



2003). Such criteria do not provide any guarantees of the terminal iterate being 



close to the optimum however and may be satisfied by random chance. |Booth| 



et dl. ( 1999 ) recommend applying such rules for several consecutive iterations 



to minimize chances of a premature stop. However, Jank (2006) gives an illus- 
trative example to show that even this is not enough safeguard. Moreover, the 
performance of SA can become excruciatingly slow in the later iterations due to 
the small step sizes. 

Through our experimentations, we observe that gains made by stochastic 
NCVMP are usually largest in the first few iterations. However, beyond a certain 
point, it can become even slower than standard NCVMP if the step sizes are too 
small or the iterates simply bounce around if the step sizes are still too big. 
We therefore suggest switching from stochastic to standard NCVMP when the 
stochastic NCVMP shows signs of slowing down. Using the lower bound both 
as a switching and stopping criterion, we propose switching from stochastic to 
standard NCVMP when the relative increase in the lower bound is less than 
10 -3 and terminating the standard NCVMP when the absolute relative change 
in the lower bound is less than 10~ 6 . This proposal may be more applicable 
to moderate-sized data sets such as those considered in this paper. For large 
datasets or streaming data, it might be more practical to terminate the algorithm 
beyond a certain period of available runtime. 

The mini-batches in line 4 of the stochastic NCVMP algorithm were chosen 
by random-partitioning of the data set. In this paper, the mini-batch sizes con- 
sidered were such that different batches differ in size by at most one when n 
is not divisible by \S\. For greater efficiency the lower bound is computed only 
after a complete sweep has been made through the data set. We replace t by 
s w + in the step size where s w indicates the number of sweeps that has been 
made through the data, M denotes the number of partitions of the data and 
< to < M — 1 denotes the number of batches that has been analysed. It is pos- 
sible to include an update of Wi, i = 1, .., .n after each complete sweep. However, 
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preliminary investigation suggested no significant improvement in results when 
Wi is updated and hence, for the examples in this paper, we did not update Wi 
beyond the initialization. 



Automatic diagnostics of prior-likelihood conflict as a by-product of 
VMP 



Marshall and Spiegelhalter ( 2007 ) considered hierarchical models where observa- 



tions are grouped within units and parameters of these units are assumed to be 
drawn from some population model. They investigated a diagnostic test for iden- 
tifying divergent units based on measuring the conflict between likelihood of a 
parameter and its predictive prior given the remaining data. A simulation-based 
approach was adopted and diagnostic tests were performed using MCMC. As 
full cross-validation methods are computationally demanding, they also consid- 
ered full data approximations which were shown to demonstrate only moderate 



conservatism. We show that the approach of Marshall and Spiegelhalter ( 2007 ) 



can be approximated in the VMP framework and that VMP facilitates an au- 
tomatic computation of diagnostics for prior- likelihood conflict. Intuitively, the 
idea is that we can separate the messages coming from above and below a node 
in a hierarchical model, and when there are "mixed messages" this is indicative 
of conflict. These diagnostics are very useful for model criticism. We focus on 
NCVMP for GLMMs. 



First, we review briefly the diagnostic test proposed by Marshall and Spiegel 



halter (20071. In the context of GLMMs with a PNCP, the parameter of interest 
for i dentif ying divergent units is &i, i = 1, n. For c^, [Marshall and Spiege lhah] 



ter 



(2007) suggest generating a predictive prior replicate aj ep ~ p(&i\y-i) where 



y_j denotes the observed data y with unit i left out and 



p(oi\y-i) = / p(ai\P,D)p(l3,D\y^) df3dD. 



(6) 



In the simulation approach, (3 Icp , D lcp would be generated fromp(/3, D\y_i) using 
MCMC followed by simulation of a™ 5p |/3 rop , D lop . This is compared with a like- 
lihood replicate af x ~ p(oa\Vi) generated using only data from the unit yi being 
tested and a non-informative prior, p{on), for an since p(&i\yi) oc p(yi\cti)p(oti). 
These prior and likelihood replications represent two independent sources of ev- 
idence about &i and conflict between them suggests discrepancies in the model. 

The above discussion ignores nuisance parameters. In our case, we need to 
regard j3 as a nuisance parameter. As p(&i\yj) oc p(&i) J p{yi\P, ai)p((3\dti) dj5 and 
P is not estimable from individual unit i, Marshall and Spiegelhalter] ( 2007| ) [pg. 



420] recommend generating af x from f(cei\y) where 



f{<*i\y) <xp(6ti) / p(yi\&i,l3)p(P\y-i)d0. 
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Note that the two replications a Y ° p and fif x are no longer entirely independent 
as y-i will slightly influence af* through (3. 

To compare the prior and likelihood replicates, Marshall and Spiegelhalter] 



( 2007 ) considered af lS = aj ep — af x and calculated a conflict p- value 

pton = P(&f s < Q\v) 

as the proportion of times simulated values of af lff are less than or equal to zero 
for scalar 5j. Depending on the context, the upper tail area pf con = 1 — pf con 
or the 2-sided p- value 2 x min(p^ con ,p^ con ) may be of interest instead. If ctf lS is 
not a scalar, E(a^ lff \y) T Cov(af lS \y)~ 1 'E(af iS \y) can be used as a standardised 
discrepancy measure. An alternative to this cross- validatory approach is to sim- 
ulate aj ep |/3 rcp , D Icp using /3 rcp , D lcp generated from p(0,D\y) without leaving 
out yi. This introduces only mild conservatism in normal hierarchical models as 



y, influences <Sj ep through j3 and D (Marshall and Spiegelhalter 2007) 



We attempt to approximate the approach of |Marshall and Spiegelhalter] 



(2007) in the VMP framework. Recall that the variational posterior for a, is 



ivec(S? - 1 ) 

Gaussian with natural parameter Aa, = , . The NCVMP update 



for \&. is given by 



^(Aa < )- 1 5£r[^{logp(a 4 |j8,D)} + ^,{Iogp( W |a < ,)9)}] 



-^-vec(S q 



J V ^IikWik / 



where we let ViX^ )' 1 j^-EJlogply^, P)\ = ( fe^^Y The first term 

can be considered a message from the prior p(a.i\f3,D) and the second term 
a message from the likelihood of unit yi, p(yi\cti, (3). We argue that the first 
message from the prior can be interpreted as natural parameter of a Gaussian 
approximation to p(cti\y-i) while the second message from the likelihood can 
be interpreted as natural parameter of a Gaussian approximation to f(ai\y). 
This implies that aj cp - N(Wifi q , ^S q ) and af* ~ JV(Mlik, S uk ) so that af m ~ 
N(Wifi q p — /link, -^S q + Snk) assuming aj cp and af x are considered independent. 
Since these messages are computed in the NCVMP algorithm, conflict p-values 
can be calculated easily at convergence for identification of divergent units. 

For moderate to large data sets, the difference between p(/3, D\y_j) and 
p(/3,D\y) is small and we approximate p(j3, D\y_j) in Q by the variational 
posterior q(f3)q(D). This combined with Jensen's inequality gives 

logp(a 4 |y_i) w logE_ &i {p(oi\0,D)} 
>E^{logp(a t \^D)}. 

Approximating p(aj|j/_j) by exp[£ , _Q.{logp(ai|/3, D)}], we then have aj cp ~ 
N(WiHp 7 ~tqS q )- On the other hand, the total message gives us the natural pa- 
rameter of q{5ii) which is an approximation of p(cti\y). If we think of p{on\y-i) 
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as the 'prior' to be updated when yi becomes available, we have 
p(&i\y) oc p(ai\y-i)p(yi\ai,y-i) 

which implies that 

P{oii\y) r I ~ \ 
°zp(yi\ai,y-i). 



p(&i\y- 

Thus, interpreting the first message as natural parameter of a Gaussian approx- 
imation to p{a.i\y-i) and the sum of the two messages as natural parameter of a 
Gaussian approximation to p(&i\y), the ratio of these two normal distributions 
gives an approximation (up to a proportionality constant) of p(yi\&i, y~i). As a 
function of oti , the ratio of the two normal distributions is proportional to 

exp{-|(a,- M | i rS| < - 1 (a i - M | i )} 
cxp{-i(5 4 - Wi^viSi-^&i - Winl)} 

which gives a normal distribution with natural parameters 
precisely that given by the second message. As 



p(yi\ai,V-i) = J p(yi\0,&i)p(/3\ai,y-i)d/3 

and p(/3\&i,y-i) is close to p{j3\y-i) when the number of clusters is large, the 
second message can be considered as giving the natural parameter of a Gaus- 
sian approximation to /(<Si|y) if we assume a uniform prior for p(&i). Finally, 
even though aj ep and af x are not entirely independent, for large data sets, the 
dependence between <S- cp and af x will be increasingly weak as the number of 
clusters increases so that af lS ~ N(Wip,p — ^nk, ~^S q + Enk) approximately. 

This approximate distribution for af lS can be used to calculate conflict p- 
values when the NCVMP algorithm has converged. We illustrate this method 
with an example in Section 5.1 and the results show very good agreement between 



the cross- validatory conflict p- values computed using the approach of Marshall 



and Spiegelhalter (20071 and those obtained from NCVMP. For large data sets, 
automatic computation of diagnostics for prior-likelihood conflict can be an at- 
tractive alternative to the simulation-based approach using MCMC methods. 
While the approximations made in our derivation of the diagnostic are crude, 
the diagnostics can be computed automatically as part of the NCVMP algorithm 
and can be regarded as a handy screening tool with the clusters flagged as diver- 
gent studied more closely and possibly conflict p-values recomputed by Monte 
Carlo. The arguments above generalize to detecting conflict for other parameters 
of the model also. 
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5. Examples 



In Section 5.1, we use the Bristol infirmary inquiry data to compare the conflict p- 
values computed using the NCVMP algorithm with those obtained using the ap- 
proach of Marshall and Spiegelhalter (2007). In Sections 5.2 and 5.3, we apply the 
stochastic version of NCVMP to a real data set and a simulated data set respec- 
tively. In all the examples, the PNCP was used and we considered a -/V(0, 1000) 



prior for f3 and a inverse Wishart default conjugate prior ( |Kass and Natarajan 



200G) for D. We also experimented with various settings of A' and 7. The Musca- 



tine coronary risk factor study data set and the skin cancer prevention study data 
set can be found at http://www.biostat.harvard.edu/~fitzmaur/ala2e/ 



5. 1 . Bristol infirmary inquiry data 

In 1998, a public inquiry was set up to look into the management of children 
receiving complex cardiac surgical services at the Bristol Royal Infirmary from 
1984 to 1995. The outcomes of paediatric cardiac surgical services at Bristol 
relative to other specialist centres was a key issue. We consider a subset of the 
data presented to the Inquiry recorded by Hospital Episode Statistics on the 
mortality rates in open surgeries for 12 hospitals including Bristol (hospital 1), 
for children under 1 year old, from 1991-March 1995. This data can be found 
in |Marshall and Spiegelhalter| (|2007[) Table 1. |Spiegelhalter et d.\ (|2002[) and 



Marshall and Spiegelhalter (20071 modelled this data using a logistic GLMM. 



Although the number of clusters is small in this example whereas our method- 
ology is motivated by applications to large data sets, this example is interesting 
as a benchmark data set in the literature for calculating prior-likelihood conflict 
diagnostics from the NCVMP algorithm. 

Let Yi — Dij represent the number of deaths at hospital i, i = 1, 12. 



We have yij ~ Bernoulli^) where 
otherwise. Let 

logit(7Ti) = j3 + v 



j = 1 if patient j at hospital i died and 
where Ui~N(0,D). 



To assess the accuracy of the approximate conflict p-values obtained from the 
standard NCVMP algorithm, we use the cross-validatory conflict p-values ob- 



tained using the simulation-based approach of Marshall and Spiegelhalter ( 2007 ) 



as a 'gold-standard' and compute these for comparison. In the cross-validatory 
approach, each hospital i is removed in turn from the analysis, and the pa- 
rameters (3 rop , D Tep \y_i are generated using MCMC followed by a simulated 
7r^ Gp |/? rep , D rep . Assuming a Jeffrey's prior for iTi, a nf x is then simulated from 
Beta(Vi + 0.5, Uj — Yi + 0.5). Excess mortality is of concern and the upper-tail 
area is used as a 1-sided p- value so that Pi^on = -P( 7r i° P — ^t*)- 100 000 sim- 
ulations were used in calculating the cross-validatory conflict p-values. Fitting 



via MCMC was performed in WinBUGS ( Lunn et al. . 2000 1 through R by using 



R2WinBUGS (Sturtz et al. 2005) as an interface. The CPU time taken to com- 



pute the cross-validatory conflict p-values using the simulation-based approach 



1G 



Table 2. Cross-validatory 
conflict p-values (p^ y con ) and 
approximate conflict p-values 
from NCVMP (pS mp ). 



ri aotm i~ a 1 


n,con 


NCVMP 

Pi, con 


1 
1 


n 001 


004 

U.UU4; 


2 


0.434 


0.448 


3 


0.935 


0.934 


4 


0.125 


0.131 


5 


0.298 


0.305 


6 


0.721 


0.730 


7 


0.740 


0.751 


8 


0.660 


0.672 


9 


0.438 


0.450 


10 


0.377 


0.386 


11 


0.766 


0.771 


12 


0.723 


0.733 



was 465 seconds. 

The cross-validatory conflict p- values (pf^ on ) and approximate conflict p- 
values from NCVMP (p^ c c D n MP ) for all hospitals are shown in Table [2] These two 
sets of p- values are plotted in Fig. [2] which indicates very good agreement between 
the two. To reflect the importance of good agreement at the extremes, Marshall 
and Spiegelhalter ( 2007 ) computed the relative agreement between p- values as 



$- 1 ( P ; 



cv 

/.con 



-1C„NCVMP\ 



x 100% 



where $~ 1 denotes the inverse cumulative distribution function of the standard 
normal. The relative error between p^f on and pf ^,„ MP is 9% which is close to the 
relative error of 7% between cross-validatory and full data conflict p-values re- 



ported in Marshall and Spiegelhalter (2007). In this case, the NCVMP algorithm 



yields very good results and takes only 7 seconds. For moderate to large data sets, 
the VMP approach will be an extremely attractive alternative to computation- 
ally intensive MCMC methods for obtaining prior-likelihood conflict diagnostics. 



5.2. Muscatine coronary risk factor study 

A total of 485G children took part in the Muscatine coronary risk factor study 
(Woolson and Clarke 1984) which was undertaken to examine the development 
and persistence of risk factors for coronary disease in children. Over the period 
1977-1981, weight and height data were collected biennially from five cohorts of 
children, aged 5-7, 7-9, 9-11, 11-13 and 13-15 at the beginning of the study. 
The data is incomplete with less than 40% of the children surveyed on all three 
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Fig. 2. Plot of cross-validatory conflict p-values against approximate conflict p-values 
from NCVMP. 



occasions. In previous analyses, some authors treated this data as potentially 



missing not at random (for instance, Zhou et al. (2010)) while others assumed 



the data are missing at random (MAR) (Fitzmaurice et al. 1994 Kenward and 



Molenberghs 19981). We assume the data are MAR and focus on computational 



comparisons between standard and stochastic NCVMP. The binary response, 
Uij, is an indicator of whether the ith child is obese at the jth occasion. For the 
ith child, we consider the covariates, gender i = 1 if female, if male and age^ = 
midpoint of age cohort at jth occasion —12. Fitzmaurice et al. (2004) modelled 



the marginal probability of obesity as a logistic function of gender and linear 
and quadratic age. We consider the following logistic random intercept model, 



logit(^) = f3 + /3i gender, + ft-age^- + /3 3 age 



where u, ~ N(0,a 2 ) for % = 1,...,4856, 1 < j < 3. The standard NCVMP 
algorithm took 823 seconds to converge for this moderately large data set. 
The performance of stochastic NCVMP was investigated using different mini- 
batch sizes and various parameter settings for the step sizes. We considered 
\S\ € {1,50,99,242} where the mini-batch sizes were chosen to correspond to 
the online setting and approximately 1%, 2% and 5% of n = 4856. We let the 
stability constant K take values 0, 1 and 5 and 7 be 0.5, 0.75 or 1. In the online 
setting, the stochastic VA diverges for each K e {0, 1, 5} and 7 € {0.5, 0.75, 1}. 
We thus considered larger stability constants K e {250,500, 1000} for IS*! = 1. 
For each mini-batch size and parameter setting for the step-size, we did five runs 
of the stochastic NCVMP switching to standard NCVMP each time the relative 
increment in the lower bound after a complete sweep through the data is less 
than 10~ 3 . The average time taken for the algorithm to converge in each case is 
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Mini-batch size = 1 Mini-batch size = 50 Mini-batch size = 99 Mini-batch size = 242 




400 800 012345 012345 012345 

K K K K 



Fig. 3. Coronary risk factor study. Plot of average time to convergence against the sta- 
bility constant K for different mini-batch sizes. The solid, dashed and dot-dashed lines 
correspond to 7 = 1 , 0.75 and 0.5 respectively. 



Table 3. Coronary risk factor study. 
Best parameter settings and aver- 
age time to convergence (in sec- 
onds) for different mini-batch sizes. 



\s\ 


1 


50 


99 


242 


K 


250 


1 








7 


1 


1 


0.75 


0.5 


time 


426 


310 


271 


346 



shown in Fig. [3j The solid lines, dashed lines and dot-dashed lines correspond to 
7 = 1, 0.75 and 0.5 respectively. The best parameter settings and average time 
to convergence for each mini-batch size are summarized in Table [3j From these 
results, we observed that as the mini-batch size increases, smaller values of 7 
and K, that is, a slower rate of decrease in step-size and larger step-sizes lead to 
faster convergence. However, a significantly larger stability constant and smaller 
step sizes are required in the online setting to prevent unstable behaviour in the 
early iterations. The mini-batch size of 50 (approximately 1% of n) performed 
well across a wide range of step-sizes with the average time to convergence rang- 
ing from 310 to 413 seconds. The shortest average time to convergence is 271 
seconds for the mini-batch of size 99 with K = and 7 = 0.75. This implies 
a reduction in convergence time of around 67% from standard NCVMP. Fig. [4] 
tracks the average lower bound attained at the end of each sweep through the 
data for the different batch sizes corresponding to the best parameter settings 
listed in Table [3] Only the first ten sweeps are shown. This figure shows that 
with appropriately chosen step-sizes, stochastic NCVMP is able to make much 
bigger gains than standard NCVMP particularly in the first few sweeps. Thus, 
even for moderate-sized data sets, significant gains can be made by making use 
of stochastic NCVMP. 
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Fig. 4. Plot of average lower bound against number of sweeps through entire data set 
for different batch sizes under the best parameter settings. 



5.3. Skin cancer prevention study 

In a clinical trial conducted to test the effectiveness of beta-carotene in prevent- 
ing non-melanoma skin cancer (Greenberg et al. 1989[), 1805 high risk patients 



were randomly assigned to receive either a placebo or 50 mg of beta-carotene per 
day for five years. Subjects were biopsied once a year to ascertain the number of 
new skin cancers since the last examination. The response is a count of the 
number of new skin cancers in year j for the ith subject. Covariate information 
for the ith subject include age i; the age in years at the beginning of the study, 
gender.^ = 1 if male and if female, exposure^, a count of the number of previ- 
ous skin cancers, skin^ = 1 if skin has burns and otherwise, treatment^ = 1 if 
the ith subject receives beta-carotene and if placebo and year i:) -, the year of 
follow-up. We consider n = 1683 subjects with complete covariate information. 
Using conditional Akaike information to perform model selection, |Donohue e1\ 



al. (2011 1 fitted different Poisson GLMMs to this data and arrived at the model 



log(nij) = (3 + f3i&ge { + /3 2 skini + f3 3 gendei i + /^exposure, + u. t , 

where m ~ N(0, a 2 ) for i = 1, 1683, 1 < j < 5. The treatment and year effects 
did not prove to be significant in their analyses. Using this model, we investigate 
the performance of standard and stochastic NCVMP algorithms. As this data 
set is small, preliminary investigation shows that the time to convergence of the 
standard and stochastic NCVMP algorithms are close and stochastic NCVMP 
did not provide any significant gains over standard NCVMP. We thus simulated 
a data set comprising of n = 1683 x 6 = 10098 subjects by using the posterior 
means of the unknown parameters from the standard NCVMP fit to the original 
data set. Thus, we replicate the design matrices for each cluster 6 times. For this 
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Table 4. Skin cancer study. Best 
parameter settings and average 
time to convergence (in seconds) 
for different mini-batch sizes 



\s\ 


1 


100 


198 


504 


K 


1000 


1 


1 





7 


1 


1 


1 


1 


time 


278 


233 


219 


216 



Mini-batch size = 1 Mini-batch size = 100 Mini-batch size = 198 Mini-batch size = 504 




400 800 012345 012345 012345 

K K K K 



Fig. 5. Skin cancer study. Plot of average time to convergence against the stability con- 
stant K for different mini-batch sizes. The solid, dashed and dot-dashed lines correspond 
to 7 = 1,0.75 and 0.5 respectively. 

simulated data, standard NCVMP took 415 seconds to converge. 

We considered mini-batch sizes corresponding to the online setting and ap- 
proximately 1%, 2% and 5% of n = 10098, that is, |5| G {1, 100, 198,504}. We 
let 7 be 0.5, 0.75 or 1 and the stability constant K take values 0, 1 and 5 for 
|5| G {100, 198,504} and values 250, 500, 1000 for \S\ = 1. For each mini-batch 
size and parameter setting for the step-size, we did five runs of the stochastic 
NCVMP, switching to standard NCVMP each time the relative increment in the 
lower bound after a complete sweep through the data is less than 10 -3 . The 
average time taken for the algorithm to converge in each case is shown in Fig. [5] 
The solid lines, dashed lines and dot-dashed lines correspond to 7 = 1, 0.75 and 
0.5 respectively. The best parameter settings and average time to convergence 
for each mini-batch size are summarized in Table |4j 

In the online setting, stochastic NCVMP diverges for each 7 G {0.5,0.75, 1} 
when K = 250. As in Example 5.2, larger stability constants are preferred when 
I SI = 1. For this simulated data, a higher rate of decrease in step-size is desirable 
with 7 = 1 yielding the best performance across different mini-batch sizes. Larger 
batch sizes also seem to lead to faster convergence. Fig. [6] compares the rate 
of convergence of standard and stochastic NCVMP for one of the runs where 
\S\ = 504, K = and 7 = 1. The variational lower bound £ is -23617.3 
at convergence and we have plotted log(— 23617 — £) against time. Stochastic 
NCVMP took just 7 sweeps to converge in 216 seconds while standard NCVMP 
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100 200 300 400 
time in seconds 



Fig. 6. Plot of log(-23617 - C) against time for the mini-batch of size 504, K = and 
7=1. 



took 22 sweeps and converged in 415 seconds. This represents a reduction of 
about 50% in convergence time. 



6. Conclusion 

In this paper, we have extended stochastic variational inference to non-conjugate 
models and derived a stochastic version of the NCVMP algorithm, scalable to 
large data sets. The data sets that we have considered in this paper were only of 
moderate size. Nevertheless, by applying the stochastic version of the NCVMP 
algorithm in the first few iterations, the time to convergence for these data sets 
was reduced by half or more. We would imagine the gain to be bigger for larger 
data sets and more work remains to be done in that aspect. Experimentation with 
various settings of K and 7 suggest that 7 close to 1 and a large stability constant 
K is preferred in the online setting while mini-batches larger in size perform 
better with larger step-sizes. Comparison of the conflict p- values obtained from 
the NCVMP algorithm with those computed using the approach of Marshall and 



Spiegelhalter (2007) suggest very good agreement. For large data sets, the VMP 
approach will be an extremely attractive alternative to computationally intensive 
MCMC methods in obtaining prior-likelihood diagnostics. All code was written 
in the R language and run on a dual processor Window PC 3GHz workstation. 
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